🟪 Today’s data: Ramen Ratings

Read Ramen Ratings dataset by The Ramen Rater. (https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-04)

ramen<- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv")
head(ramen)
str(ramen)
## 'data.frame':    3180 obs. of  6 variables:
##  $ review_number: int  3180 3179 3178 3177 3176 3175 3174 3173 3172 3171 ...
##  $ brand        : chr  "Yum Yum" "Nagatanien" "Acecook" "Maison de Coree" ...
##  $ variety      : chr  "Tem Tem Tom Yum Moo Deng" "tom Yum Kung Rice Vermicelli" "Kelp Broth Shio Ramen" "Ramen Gout Coco Poulet" ...
##  $ style        : chr  "Cup" "Pack" "Cup" "Cup" ...
##  $ country      : chr  "Thailand" "Japan" "Japan" "France" ...
##  $ stars        : num  3.75 2 2.5 3.75 5 3.5 3.75 5 3.5 4.25 ...

🟪 Object : Find the rating distributions for countries with more than 200 products.

pre-processing

  • What countries are there?
  • How many products each of them have?
  • Missing values?

What countries?
library(dplyr)
ramen$country%>%table()
## .
##     Australia    Bangladesh        Brazil      Cambodia        Canada 
##            25            11            12             5            48 
##         China      Colombia         Dubai       Estonia          Fiji 
##           207             6             3             2             4 
##       Finland        France       Germany         Ghana       Holland 
##             3             4            28             2             4 
##     Hong Kong       Hungary         India     Indonesia         Italy 
##           155             9            41           150             3 
##         Japan      Malaysia        Mexico       Myanmar         Nepal 
##           532           182            28            14            14 
##   Netherlands   New Zealand       Nigeria      Pakistan   Philippines 
##            16             1             2             9            49 
##    Phlippines        Poland        Russia       Sarawak     Singapore 
##             1             6             3             5           134 
##   South Korea        Sweden        Taiwan      Thailand            UK 
##           357             3           330           205            69 
##       Ukraine United States           USA       Vietnam 
##             3           382             1           112
sort(table(ramen$country), decreasing=T)
## 
##         Japan United States   South Korea        Taiwan         China 
##           532           382           357           330           207 
##      Thailand      Malaysia     Hong Kong     Indonesia     Singapore 
##           205           182           155           150           134 
##       Vietnam            UK   Philippines        Canada         India 
##           112            69            49            48            41 
##       Germany        Mexico     Australia   Netherlands       Myanmar 
##            28            28            25            16            14 
##         Nepal        Brazil    Bangladesh       Hungary      Pakistan 
##            14            12            11             9             9 
##      Colombia        Poland      Cambodia       Sarawak          Fiji 
##             6             6             5             5             4 
##        France       Holland         Dubai       Finland         Italy 
##             4             4             3             3             3 
##        Russia        Sweden       Ukraine       Estonia         Ghana 
##             3             3             3             2             2 
##       Nigeria   New Zealand    Phlippines           USA 
##             2             1             1             1

Wait, Holland, Netherlands, United States, USA? Let’s keep them in mind and check for NAs first.

ramen%>%is.na()%>%table
## .
## FALSE  TRUE 
## 19065    15

Print out the rows with NA values.

ramen[!complete.cases(ramen),]

Netherlands, Holland will be removed since the sum of those two are less then 200.

However, USA would remain since it is a complete row. Let’s change it into United States.

ramen$country[ramen$country=="USA"]<- "United States"
ramen$country %>% table() %>% sort(decreasing = T)
## .
##         Japan United States   South Korea        Taiwan         China 
##           532           383           357           330           207 
##      Thailand      Malaysia     Hong Kong     Indonesia     Singapore 
##           205           182           155           150           134 
##       Vietnam            UK   Philippines        Canada         India 
##           112            69            49            48            41 
##       Germany        Mexico     Australia   Netherlands       Myanmar 
##            28            28            25            16            14 
##         Nepal        Brazil    Bangladesh       Hungary      Pakistan 
##            14            12            11             9             9 
##      Colombia        Poland      Cambodia       Sarawak          Fiji 
##             6             6             5             5             4 
##        France       Holland         Dubai       Finland         Italy 
##             4             4             3             3             3 
##        Russia        Sweden       Ukraine       Estonia         Ghana 
##             3             3             3             2             2 
##       Nigeria   New Zealand    Phlippines 
##             2             1             1

According to the data description, rating interval is 0.25. Let’s remove the rows with weird stars value.

ramen %>% filter(stars%%0.25 != 0)
ramen %>% filter(stars%%0.25 != 0) %>% select(country) %>% table()
## country
##         China         Japan      Malaysia   South Korea        Taiwan 
##             1             1             1             7             5 
##      Thailand United States       Vietnam 
##             6             1             1
ramen <- ramen %>% filter(stars%%0.25 == 0)

Now, subset the data: only from the countries with more than 200 products.

ramen200<- ramen[complete.cases(ramen),] %>%
  group_by(country) %>% mutate(productCount = n()) %>% 
  filter(productCount>200)
head(ramen200, 10)
ramen200%>%is.na()%>%table
## .
## FALSE 
## 12467
ramen200$country %>% table() %>% sort(decreasing = T)
## .
##         Japan United States   South Korea        Taiwan         China 
##           528           374           348           325           206

Calculate the numbers for distributions

Check the summary of the data first.

summary(ramen200)
##  review_number     brand             variety             style          
##  Min.   :   1   Length:1781        Length:1781        Length:1781       
##  1st Qu.: 747   Class :character   Class :character   Class :character  
##  Median :1594   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1612                                                           
##  3rd Qu.:2535                                                           
##  Max.   :3179                                                           
##    country              stars        productCount  
##  Length:1781        Min.   :0.000   Min.   :206.0  
##  Class :character   1st Qu.:3.500   1st Qu.:325.0  
##  Mode  :character   Median :3.750   Median :374.0  
##                     Mean   :3.743   Mean   :386.2  
##                     3rd Qu.:4.500   3rd Qu.:528.0  
##                     Max.   :5.000   Max.   :528.0
ramen200_summ<- ramen200 %>% group_by(country, stars) %>% summarise(ratingCount =n()) %>% mutate(ratingRation = ratingCount/sum(ratingCount))
ramen200_summ

Or, try tapply to show simple statistics for each country

tapply(values, index, function): operate function on values for each group in index

tapply(ramen200$stars, ramen200$country, summary)
## $China
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.250   3.750   3.473   4.188   5.000 
## 
## $Japan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.500   4.000   3.913   4.750   5.000 
## 
## $`South Korea`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.500   3.750   3.843   4.250   5.000 
## 
## $Taiwan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.250   4.000   3.801   5.000   5.000 
## 
## $`United States`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   3.750   3.509   4.188   5.000
tapply(ramen200$stars, ramen200$country, table)
## $China
## 
##    0 0.25  0.5    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
##    6    3    2    3    1    4    4    3    2    3    8   11   11   26   41   26 
## 4.25  4.5 4.75    5 
##   18   14    2   18 
## 
## $Japan
## 
##    0 0.25  0.5 0.75    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75 
##    4    3    2    1    6    1    5    1    7    2   12   15   21   20   69   63 
##    4 4.25  4.5 4.75    5 
##   62   39   58   32  105 
## 
## $`South Korea`
## 
##    0  0.5    1 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 4.25  4.5 4.75 
##    2    1    2    1    9    3    7    7   11   19   55   65   59   24   20   10 
##    5 
##   53 
## 
## $Taiwan
## 
##    0 0.25  0.5    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
##    4    1    1    6    4    6    4    7    3    7    8    9   26   28   45   34 
## 4.25  4.5 4.75    5 
##   23   16    7   86 
## 
## $`United States`
## 
##    0 0.25  0.5    1 1.25  1.5 1.75    2 2.25  2.5 2.75    3 3.25  3.5 3.75    4 
##    8    3    1    2    1    9    6   15    5   10   16   21   30   53   55   45 
## 4.25  4.5 4.75    5 
##   23   19    7   45

Well…?


Visualize the information

Role of data visualization and its impact.

The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.
The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics (mean, standard deviation, and Pearson’s correlation) to two decimal places.


To get your audience to understand your data effectively, you should select proper chart types.


plot(y, x, main="title", xlab="x-axis label", ylab="y-axis label"): scatter plot (산점도).

  • Both y and x should be numeric
  • Given y only, you can see the distribution of individual values.
  • Given y and x together, you can see the relation between them.
plot(ramen200$stars)

plot(ramen200$stars, main="Distribution of Ratings: all 5 countries", ylab="Rating")


barplot(height, main="title", xlab="x-axis label", ylab="y-axis label"): bar chart (막대그래프)

  • Draw bars of height. height is y value.

for formula, barplot(formula, data, main="title", xlab="x-axis label", ylab="y-axis label")

barplot(ramen200$stars)

barplot(table(ramen200$stars), main="Distribution of Ratings: all 5 countries", ylab="Frequency", xlab="Rating")

barplot(data = ramen200_summ, ratingCount ~ stars +country)


hist(x, main="title",xlab="x-axis label", ylab="y-axis label"): histogram

  • display bars for the frequency of x values. frequency of x is y value.
par(mfrow = c(1, 3)) 
hist(ramen200$stars, main = "Histogram")
barplot(ramen200$stars, main = "Barplot")
barplot(table(ramen200$stars), main = "Barplot: table")

par(mfrow = c(1, 2)) 
hist(ramen200$stars, main = "Histogram", breaks = seq(0,5,by=0.25))
barplot(table(ramen200$stars), main = "Barplot: table")


pie(x, labels = names(x), main="title") : pie chart

  • values in numerical vector x is displayed in order. If values have names, it is labeled for the correspoding slice, otherwise, numbered from 1.
pie(ramen200$stars)

pie(ramen200$stars[1:10])

pie(table(ramen200$stars[1:10]))

pie(table(ramen200$stars), main="Distribution of Ratings: all 5 countries")


What is the main message of our data? What should we emphasize?

What about these?
library(ggplot2)
ggplot(ramen200[order(ramen200$stars),], aes(x = country, y = stars, fill = stars)) + 
  geom_bar(stat = "identity") +
  scale_fill_gradient2(low = "green", high = "red", mid = "yellow", midpoint =2.5) +
    labs(
    title = "Distribution of Ratings by Country",
    x = "Country",
    y = "Rating")

#install.packages("ggbeeswarm")
library(ggbeeswarm)
ggplot(ramen200, aes(x = country, y = stars, color = country)) +
  geom_beeswarm(cex = 0.25 , alpha=0.8, show.legend = FALSE) +
  labs(
    title = "Distribution of Ratings by Country",
    x = "Country",
    y = "Rating"
  )